home *** CD-ROM | disk | FTP | other *** search
- IEN 120
-
-
-
-
-
-
- INTERNET ROUTING AND THE NETWORK PARTITION PROBLEM
- IEN #120
- PRTN #279
- Radia Perlman
- Bolt Beranek and Newman, Inc.
- October, 1979
-
-
- I. INTRODUCTION
-
- As described in IEN 110, "Internet Addressing and Naming in a
- Tactical Environment", a network can become partitioned into two
- or more pieces. Assuming some of these pieces are still
- connected to the catenet, we would like the catenet to be able to
- efficiently deliver packets to a host in any such piece. Such a
- capability in the catenet could additionally be utilized by a
- scheme for delivering intranet traffic across partitions in a
- partitioned network.
-
- There are four parts to the solution:
- 1) detecting that a network is partitioned
- 2) deriving a name for each partition
- 3) figuring out which partition a host is in
- 4) routing packets to the correct partition
-
- The currently implemented gateway routing algorithm is based on
- the original ARPANET algorithm. To efficiently provide for
- routing to network partitions, routing must be based on a link
- state routing scheme. I will demonstrate this after first
- presenting the design (parts II-VIII), then showing what would be
- involved in modifying the original ARPANET algorithm for that
- purpose (part IX), and then comparing the two approaches (part
- X).
-
-
- II. TERMINOLOGY
-
- 1) neighbor gateways--two gateways attached to the same network
-
- 2) functioning neighbor gateways--neighbor gateways able to
- communicate with each other over their common network
-
- 3) attached network--a network physically attached to a gateway,
- and with which the gateway can communicate directly (not through
- another gateway)
-
- 4) neighbor network of gateway G--an attached network of a
- functioning neighbor gateway of G, excluding attached networks of
- G.
-
-
- - 1 -
-
-
-
- III. TABLES TO BE MAINTAINED BY EACH GATEWAY
-
- 1) a list of attached networks--This list is relatively constant
- and is updated by a gateway when it notices a network interface
- is down or for some other reason the gateway is incapable of
- communicating with an attached network. Keeping this table
- updated is solely the responsibility of each gateway, and does
- not require intergateway communication.
-
- 2) a table of all gateways and their attached networks--This
- table is maintained by intergateway communication -- gateways
- give copies of their table 1 to all other gateways. The table of
- all gateways never shrinks (a down gateway is assumed to exist
- but be unreachable).
-
- 3) a table of link states to neighbor gateways--This table in
- gateway G specifies, for each neighbor gateway G1, over which
- common networks G and G1 can communicate. This table is updated
- by G periodically bouncing packets off each neighbor gateway from
- which it has not recently received traffic. Note that I refer to
- two gateways as neighbor gateways even if they cannot
- (temporarily, hopefully) communicate with each other.
-
- 4) a list of neighbor networks--This list is derived from the
- table of link states to neighbor gateways and the list of
- gateways with attached networks (tables 3 and 2).
-
- 5) total link state--This is a table of all gateways and the
- state of their links to their neighbor gateways. This table is
- compiled from intergateway communication. When a gateway notices
- that its table of attached networks, or its table of link states
- to neighbor gateways (tables 2 and 3) changes, that gateway
- efficiently broadcasts this information to all other gateways in
- the catenet. To minimize numbers of reports when a link is
- flaky, a link on an attached network must be up continuously for
- some amount of time before its state is considered to change from
- down to up and trigger a link state report.
-
- 6) shortest distance matrix--This is a data structure from which
- routing decisions can be made directly. It is computed from the
- other tables. It is described more fully in part IV.
-
-
- IV. ROUTING COMPUTATION
-
- A gateway, using the tables described above, constructs a
- connectivity matrix whose rows and columns represent networks,
- and whose entries are 1 if any gateways claim to be attached to
- both networks, and infinity otherwise. Then the gateway *'s that
- matrix to construct a shortest distance matrix. (The operation
- "*" consists of "multiplying" a matrix by itself, using the
- operations min and plus instead of plus and times, until the
- result stabilizes. This is a well-known algorithm.) The gateway
-
-
- - 2 -
-
-
-
- then looks in the shortest distance matrix for the neighbor
- network (or set of such) closest to the destination network, and
- chooses a functioning neighbor gateway (or set of such) attached
- to that neighbor network, to forward packets to for that
- destination network.
-
- When a link state report changes the state of an entry in the
- connectivity matrix (remember, all gateways connecting two
- networks have to go down before a 1 changes to infinity), a
- gateway must recompute the distance matrix.
-
- This design is a slight modification of the design presented in
- "Gateway Routing", by Radia Perlman (PRTN #242, PSPWN #99). The
- modification is that the indices of the matrix are networks, not
- gateways. The purpose of this modification is to make the size
- of the matrix smaller, an important modification given that in
- the catenet there are many more gateways than networks. There
- are aspects to the scheme that are irrelevant to a discussion of
- how to solve the network partition problem, such as sequence
- numbers for link state reports, etc. The purpose of this paper
- is to direct a correct approach to the design, and not to present
- an implementation specification. Thus an implementer should read
- PRTN 242 to discover the details of a link state algorithm that
- were not relevant for presentation here.
-
- Note that an alternative to *'ing the matrix is to use the scheme
- that the ARPANET has switched over to, which is a link state
- scheme in which a shortest path routing tree is constructed from
- the connectivity information. The new ARPANET scheme is less
- costly to maintain as links change state. Its disadvantages are
- that it precludes load splitting, probably a very important
- problem in the case of the catenet, and is probably a little
- harder to implement. Since links will not change state very
- often, the author favors the overhead of the matrix *'ing scheme
- over the disadvantages of the ARPANET scheme. However, this
- decision is separable from the rest of the design and can be
- decided either way at a later time.
-
-
- V. DETECTING THAT A NETWORK HAS PARTITIONED
-
- Now we look at the problem of network partitions. In the design
- presented so far there is enough information for any gateway to
- detect a partitioned network and to isolate groups of gateways on
- each partition: A gateway G knows that network N is partitioned
- if there are two sets of gateways, set Q and set R, such that all
- gateways in both sets report they are attached to network N, but
- there are no two-way links between a member of set Q and a member
- of set R via network N. This information is derived
- independently by each gateway from the table of all gateways and
- their attached networks, and from the table of total link state
- (tables 2 and 5).
-
-
-
- - 3 -
-
-
-
- VI. DERIVING A NAME FOR EACH PARTITION
-
- It is necessary to expand the internet header to allow a field
- for identifying a network partition. The reason for this is to
- avoid the necessity for every gateway on a packet's route to
- discover to which partition the packet should be sent.
-
- The partition name must give sufficient information so that every
- gateway can make the proper routing decisions to send a packet to
- that partition, based on its tables of total link state and
- gateways/attached nets (tables 5 and 2).
-
- The following schemes for naming a partition are all done
- independently by all gateways, as opposed to having some central
- authority choose a name and inform all gateways, or having a
- group of gateways decide on a name "by committee".
-
- One method of identifying a partition is to use the name of any
- member gateway of the partition. It will not matter if two
- gateways choose different names for the same partition. Since
- the sets of gateways involved in the network partitions are
- disjoint, any member of the set identifies the set.
-
- Another method is to list (either by an explicit list or a bit
- table) the set of gateways that make up that partition. This is
- unnecessarily descriptive, since the list of gateways is
- derivable from a single member of the set. And it is a less
- robust scheme, because any change to the partition (a gateway
- going down, coming up, or the net partitioning into more pieces)
- can confuse a gateway trying to route to that set of gateways.
- In the first method, if the partition changes, the packet will be
- routed unambiguously to whatever partition the named gateway is
- in. Of course, if the named gateway goes down, the packet
- becomes undeliverable, but that is easier to deal with than
- trying to deliver a packet to a set of gateways that overlaps two
- partitions.
-
- A third method is for each gateway to number partitions from 1 to
- the number of partitions, ordered by, say, the highest numbered
- gateway in each partition. This method uses fewer bits in the
- packet header but is a much less robust scheme. With gateways
- having slightly differing information, partition names have
- different meanings. Also, partitions can switch names suddenly.
- For instance, a net can be partitioned into 2 pieces, numbered 1
- and 2, and, assuming the highest numbered gateway was down, and
- comes up in partition 2, partitions 1 and 2 now switch
- identities.
-
- Thus the recommended method of identifying a partition is the
- first method.
-
-
-
-
-
- - 4 -
-
-
-
- VII. FIGURING OUT WHICH PARTITION A HOST IS IN
-
- Now we will examine several schemes for having the correct
- partition identified in a packet. It is the responsibility of
- either the source host or first gateway to do this. By examining
- the alternative schemes we can also determine whose
- responsibility it should be.
-
- a) Source host determines correct partition by trial and error --
- The source host does not know about the structure of the catenet
- and does not know that the destination net is partitioned. When
- it sends a packet to that net with no partition name filled in,
- the first gateway to receive the packet sends back a message that
- that network is partitioned, and lists the partition names.
- Assuming there are k partitions, the source host sends k packets
- requiring ACKs to the destination, each packet addressed to a
- different partition. The packet that receives an ACK is the one
- addressed to the correct partition.
-
- If a gateway receives a packet with an incorrectly filled in
- partition name field, that gateway will send back the same kind
- of notification as for a packet with a blank field -- it will
- notify the host that the net is partitioned and list the
- partition names, or if the net is no longer partitioned, give
- that information.
-
- If the source host is sending packets that require
- acknowledgments, it will notice quickly if its packets stop
- getting successfully delivered to the destination. Then it can
- redetermine the host's partition.
-
- b) The first gateway, using trial and error -- If it is the first
- gateway that has the responsibility, it can do the same thing as
- the source host in scheme a, sending packets to the destination
- addressed to each partition to discover from which partition it
- receives an ACK. Since a network is unlikely to be partitioned
- into very many pieces, it is not costly to try all partitions.
- Either the correct partition will be found or no ACK will return
- (in which case presumably the host is down or the network is
- partitioned in such a way that some hosts are unreachable from
- all gateways). The disadvantage of having the first gateway do
- the work in this scheme is that a gateway does not know whether
- packets it is forwarding successfully reach their destination.
- Thus it must either keep a cache of host/partition
- correspondence, which can be out of date for some amount of time
- during which the gateway will misaddress packets to a
- destination, or the gateway must rediscover the correct partition
- on a packet by packet basis, which is of course unacceptably
- expensive. Also, assuming it is common for a source host to
- split its traffic among several gateways on the source net, after
- a gateway discovers the correct partition for a destination host
- it should inform all other gateways on the source net of the
- correct partition, to prevent the necessity of them rediscovering
- that fact.
-
- - 5 -
-
-
-
- c) gateways on a partitioned net could keep track of
- host/partition correspondence for their net -- Another method is
- for gateways on a partitioned net to find out which hosts they
- can reach, and exchange that information with the other gateways
- on that partitioned network. Then a gateway could respond more
- intelligently to a packet addressed to the incorrect partition by
- sending back a message giving the correct partition (to the
- packet source if that is who fills in the partition field in the
- packet header, or to all gateways on the source net otherwise).
- In addition, a gateway on the partitioned network can forward the
- misaddressed packet to the correct partition.
-
- This method requires gateways on the partitioned network either
- to keep a complete list of the hosts on the net, marked as to
- partition, or to keep a cache of hosts, adding hosts to the cache
- by querying the gateways on other partitions at the time the
- necessity of locating that host arises. In the complete list
- case, gateways on a partitioned net would periodically send
- packets requiring ACKs to all hosts on that net in order to keep
- their lists up-to-date. In the cache case, gateways would poke a
- host only when the need to know its location arose (when the
- gateway received a packet for that host, and the host was not
- already in its cache, or when a query from a gateway on a
- different partition of the net arrived, asking for that host's
- location).
-
- This method suffers from the same problem as method b, with the
- first gateway having responsibility for determining
- host/partition correspondence -- the tables in the gateways on
- the partitioned net can become out of date, during which time
- they will misdirect traffic, and they cannot constantly be
- checking their tables.
-
- Thus I recommend method a, having the source host fill in the
- partition field using the trial and error method of discovering
- host/partition correspondence.
-
-
- VIII. ROUTING PACKETS TO THE CORRECT PARTITION
-
- As stated above, a gateway G, distant from partitioned network N,
- must know which gateways are involved in a partition before G can
- correctly route a packet -- it might have to make a different
- routing decision for one partition than for another one.
-
- When G detects a network has become partitioned into n pieces, G
- must add n-1 rows and columns to its shortest distance matrix,
- i.e., it treats each partition as a separate network. It is an
- implementation detail, and not a difficult one, to ensure that
- the gateway understands the meaning of each row and column. And
- given that the gateway understands the meaning of each row and
- column, it is easy for it to fill in the connectivity matrix from
- its table of total link state. The computation is done exactly
- as in the nonpartitioned case.
-
- - 6 -
-
-
-
- IX. MODIFYING THE ORIGINAL ARPANET ROUTING FOR PARTITIONS
-
- The original ARPANET routing is the currently implemented routing
- algorithm in the gateways. The basic design is that gateways
- report their distance vector to all their neighbor gateways
- (their distance vector gives their distance to all destination
- nets). They derive their distance vector from their neighbors'
- distance vectors. (A gateway's distance to a destination net is 0
- if the gateway is directly attached to the destination net.
- Otherwise, it is 1 hop further than the neighbor closest to the
- destination.)
-
- The major modifications that are necessary to handle partitioning
- are:
-
- 1) Currently distance vectors are just a list of numbers, and
- gateways have an assembled-in offset/net number correspondence.
- Thus the vectors do not need labels for each entry. If networks
- became partitioned, more destinations would need to be reported
- in the distance vector. Either some (very complicated)
- negotiation process would need to be carried out so that all
- gateways would agree, when nets became more or less partitioned,
- on a new offset/net number correspondence, or the distance
- vectors would need labels identifying the destination whose
- distance is being reported. The problems associated with a
- negotiation process make that scheme unworkable. Thus we can
- assume the vectors would be expanded to have an identifying label
- for each destination. The label would include net number and
- partition name.
-
- 2) Gateways do not have global knowledge of the structure of the
- catenet, in contrast to a link state scheme. Thus it is the
- responsibility of the gateways on a partitioned network to notice
- that the net has become partitioned and start a routing update.
-
- In the current implementation, there is no way for gateways on a
- partitioned net to tell the difference between having their net
- partitioned and having several gateways on their net go down,
- since they do not receive information about individual gateways
- -- they only receive distance vectors from their neighbors. They
- will no longer receive distance vectors from their neighbors on a
- partitioned net, or from neighbors who have gone down, so lack of
- response from neighbors does not distinguish between dead
- neighbors and a partitioned network.
-
- Thus either distance vectors would have to contain information
- about all catenet gateways (which adds a great deal of overhead
- since there are many more gateways than nets, and the only
- purpose of doing that is to detect partitions) or gateways on a
- network would report that the network has become partitioned
- every time a gateway goes down.
-
-
-
-
- - 7 -
-
-
-
- 3) Gateways in a partition must agree on a partition name, since
- if two of them started a routing update with two different names
- for the same partition, the rest of the catenet can draw no
- conclusion except that the two partition names refer to distinct
- destination partitions. Agreeing on a name is not that easy. If
- some simple algorithm is chosen, such as highest numbered gateway
- in that partition, the name of a partition can change. Suppose
- the old partition name was 5 and it changes to 12. A source host
- (or distant gateway) has gone through the overhead of determining
- that the proper partition for a destination host was 5. When the
- name of the partition changes, this overhead must be repeated.
- Also, when the name of a partition changes, the rest of the
- gateways on the catenet must be informed of that fact so that
- they will stop reporting about obsolete partition names in their
- distance vectors.
-
-
- X. COMPARISON OF LINK STATE AND ORIGINAL ARPANET SCHEMES
-
- The link state scheme is far more robust. Because gateways have
- global knowledge, routing is more likely to proceed calmly while
- routing updates are percolating throughout the catenet.
- Partition names are not as important in the link state scheme --
- gateways do not have to agree on a single name for a partition.
-
- As stated above, because in the currently implemented scheme
- gateways report only their distances to destination networks, and
- not to individual gateways, either gateways would report network
- partitions whenever gateways went down, or the distance vectors
- would have to be expanded to include reports about all gateways.
- This is a further disadvantage of the original ARPANET scheme to
- this application.
-
- Another disadvantage of the original ARPANET routing, not related
- to partitioning, is that, because nodes do not have global
- knowledge of network connectivity, there are types of routing
- loops which they cannot distinguish from degradation of best
- routes due to connectivity changes. As currently implemented in
- the catenet, nodes report their distance to a destination as
- "infinity" (a number higher than the maximum possible distance in
- the catenet) when reporting to downstream neighbors. This fixes
- many kinds of routing loops. However, neither this scheme nor
- any variant (such as hold-down, the scheme chosen by the ARPANET
- as a modification of the original algorithm) can distinguish all
- kinds of routing loops from connectivity changes. Thus there are
- cases when a group of nodes will have to count up their distance
- to a destination until it reaches "infinity" before discovering
- the destination is unreachable. This does not make the scheme
- unworkable for the current catenet, since the longest possible
- path in the catenet is less than 10 hops. However, it is again a
- further disadvantage of the original ARPANET scheme.
-
-
-
-
- - 8 -
-
-
-
- Another important consideration is the link state scheme's
- flexibility. There are new features that the catenet is
- scheduled to provide, most notably extended routing, in which the
- functional differences between links are recognized and accounted
- for. As described in IEN #86 "Extended Internet Routing", by
- Radia Perlman, a link state scheme must be adopted eventually in
- order for the catenet to provide this service.
-
- Thus the link state approach should be adopted to provide for
- network partitioning.
-
-
- XI. CONCLUSIONS
-
- A link state scheme, as originally presented in PRTN 242,
- modified as presented in part IV of this paper should be the
- basis of internet routing.
-
- The internet header should include a field long enough for a
- gateway ID, for the purpose of specifying a partition name. A
- partition name is the ID of any member gateway on that partition.
-
- The first gateway that handles a packet checks to see if it is
- addressed to a partitioned network. If so, and if the partition
- name field in the internet header is blank, the gateway sends
- back a special packet to the source host informing it that the
- network is partitioned and giving it a name for each partition of
- that network. When a gateway on the source net handles a packet
- for an unpartitioned network in which the partition name field is
- not blank, it erases that field and informs the source that that
- network is no longer partitioned.
-
- When a source host receives notice that a network is partitioned,
- it stores the partition names for that network, and when it
- wishes to send a packet to a host on that net, it first tries all
- partitions to determine the correct one. It keeps a cache of
- host/partition correspondence. When packets for a host in its
- cache no longer reach the destination, the source host should
- again attempt to determine the correct partition for that host.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- - 9 -
-
-
-
- APPENDIX
- COMBINING USUALLY SEPARATE NETWORKS
-
- In IEN 110, Dr. Vinton Cerf raises the possibility of combining
- nets, given that the catenet could handle a partitioned network.
- In general if the networks in question are usually partitioned,
- this is a bad idea, since there is overhead involved in having a
- partitioned network. Every time a source wishes to send a packet
- to a destination, someone must discover which partition to send
- the packet to.
-
- However, the specific example discussed in IEN 110 is an example
- where there is also a cost associated with not combining
- networks. In the example there are two ground PR nets, A and B.
- There are also a number of PRs on airplanes, call them P1, P2,
- ... Pn. When Pi is within range of a PR in net A, Pi
- automatically becomes a part of network A. When Pi is within
- range of both PR nets, the nets become a single PR net.
-
- Keeping the two nets separate leads to problems of addressing the
- airplane PRs, since the net on which they reside changes.
- Combining the two nets into a single network has the overhead of
- introducing a usually partitioned network into the catenet.
-
- There is a third solution to the particular case involved here.
- That is to keep networks A and B as separate logical networks,
- and to have P1, P2, ... Pn also as separate logical networks on
- the internet level. On the packet radio level there might be
- only one net, because one of the Pi connects nets A and B. But
- on the internet level there will be n+2 nets.
-
- A gateway on net A, called G1, will have a half gateway
- associated with each of the nets it might be "directly connected
- to" in the internet sense. In other words, it will have a half
- gateway for A, P1, ... Pn. The half gateway associated with
- network A determines whether its interface to net A is up or down
- depending on the state of the hardware ready line, etc., as is
- now done. The half gateway associated with "network" Pi must
- determine whether it is "connected" to its "network" by some
- other means. One method is to have a special querying packet
- containing the number i. The packet would be addressed, with a
- local header only, to Pi, and sent out the interface to network
- A. Pi's responsibility, upon seeing this querying packet, is to
- send back a special answering packet, also containing the number
- i. The half gateway associated with network A, upon receiving
- one of these special answering packets, uses the number contained
- in the packet to dispatch the packet to the half gateway
- associated with Pi. The half gateway associated with Pi, upon
- receiving this special answering packet, knows that its "network"
- is up.
-
- G1's list of neighbor gateways will include, besides all the
- gateways on net A, all the gateways on net B, since a gateway on
-
-
- - 10 -
-
-
-
- net B also has the Pi as potential attached networks. If some Pi
- connects nets A and B, then the gateways on A and B will all
- consider each other functional neighbors, and A, B, and the
- connected Pi, which have formed themselves into a single
- functional PR net, will function as a single net on the internet
- level, too. If one of the Pi is not within reach of either net A
- or net B, then all the gateways on nets A and B will report that
- they are not attached to net Pi, and all the gateways in the
- catenet will know Pi is unreachable. If A and B have not merged
- into one net (none of the Pi are in both nets), then the gateways
- on each will report which Pi are reachable from them, so the
- catenet will automatically route packets for Pi to the correct
- ground PR net.
-
- [It would be reasonable to include, in gateway G1, a half gateway
- for net B also, since if nets A and B merged, G1 would be
- connected to net B. However, it is not necessary to and is
- slightly more efficient not to, since even if nets A and B are
- merged, PRs in B are probably physically closer to the gateways
- on net B, so the catenet should route packets for PRs in B to the
- gateways that "really" are on ground net B. The advantage of
- including a half gateway for B in G1 is that net B could
- potentially partition in such a way that some partition included
- no gateways from B, but was reachable in the catenet via net A
- and some Pi. It is not obvious, however, what algorithm a half
- gateway for B should use to determine whether its "network" is
- up.]
-
- The airplane PR Pi does not think of itself as a network. From
- its point of view it is an ordinary PR. The only difference
- between Pi and an ordinary PR on net A is that Pi (or the TIU
- attached to Pi, if we want to strictly adhere to packet radio
- terminology) has stored as its internet address, Pi for its net
- number. It also has a list of possible gateways to use for
- internet packets. This list includes all the gateways on nets A
- and B. In the current PR net there is only one gateway, and all
- PRs know the ID of the gateway. This will change such that there
- will either be a special ID for an information service that will
- give out the ID of a gateway on the net (so that Pi, instead of
- keeping a list of gateways, could ask for a gateway address, as
- would the rest of the PRs on nets A and B) or all PRs will have
- assembled in a list of gateways, and they will need to probe each
- in turn until they find one that responds. Thus the only
- difference in Pi's finding a gateway and in an ordinary PR on net
- A finding a gateway, is that (assuming the assembled-in gateway
- list scheme is used) Pi's list will be longer, since it will also
- include the gateways on net B.
-
- There is obviously a cost associated with this solution, too. If
- the number of Pi are small, then this is a reasonable solution.
- If there are enough Pi, then the cost of having all those logical
- nets becomes greater than the cost of having an often partitioned
- network, so the solution of combining A, B, and all the Pi into
- one logical net in the catenet is a more practical solution.
-
- - 11 -
-
-